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Abstract 



This paper presents the results of a simulation-based study of various translation lookaside 
buffer (TLB) architectures, in the context of a modern VLSI RISC processor. The simulators 
used address traces, generated by instrumented versions of the SPECmarks and several other 
programs running on a DECstation 5000. The performance of two-level TLBs and fully- 
associative TLBs were investigated. The amount of memory mapped was found to be the 
dominant factor in TLB performance. Small first-level FIFO instruction TLBs can be effective 
in two level TLB configurations. For some applications, the cycles-per-instruction (CPI) loss 
due to TLB misses can be reduced from as much as 5 CPI to negligible levels with typical TLB 
parameters through the use of variable-sized pages. 
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1. Introduction 

In computer systems with virtual memory, a TLB is typically used to provide fast translation 
of virtual addresses generated by instruction execution to physical addresses needed for cache 
tag comparisons. Both physically and virtually addressed caches require address translation. 
With physically addressed caches, the TLB lookup is in the critical path of cache access, so low 
latency and miss rates are crucial for memory system performance. The TLB is a cache, speed- 
ing up access to entries in the page table, where complete information on virtual to physical 
memory mappings is maintained. Most modern machines use split instruction and data caches, 
and this configuration is assumed (unless stated otherwise) in the remainder of this paper. Given 
this context, we consider several possibilities for the TLB implementation: 

• A single TLB shared between instruction and data caches. To reduce conten- 
tion, the TLB can be dual-ported. This introduces complex circuitry, doubling the 
size of the TLB without increasing its capacity. 

• Independent TLBs for instruction and data caches. The instruction TLB should 
be made smaller than the data TLB, as instruction reference streams exhibit greater 
locality than those for data. The appropriate size tradeoff is difficult to determine 
and once made is fixed. If the instruction TLB is too small, performance will suffer. 
If it is too large, the space available for the data TLB is compromised and again 
performance suffers. 

• Two-level TLB architectures. A small instruction TLB (i.e., micro-TLB), can be 
refilled from a larger single-ported shared TLB, primarily used for data references. 
This option is described in more detail below. 

A micro-TLB is a fully associative TLB with a very small number of entries (probably less 
than eight) which is reloaded in hardware from a larger shared TLB. A number of recent 

machines use micro-TLBs, including the MIPS R4000 [7], though they are invisible at the ar- 
chitecture level. A micro-TLB is accessed in parallel with the instruction cache. On a miss, the 
micro-TLB is reloaded from the shared TLB. As the larger TLB is single ported, the CPU may 
stall for a few cycles, with data references suspended while the TLB is busy, but this penalty is 
much less expensive than that of a full TLB miss. We assume 3 cycles as the micro-TLB miss 
penalty in this paper. Because of the high locality in instruction reference streams, and the rela- 
tively small miss penalty, acceptable miss rates can be achieved with a small micro-TLB. The 
balance of instruction and data entries in the shared TLB is determined dynamically, unlike in 
the second option above. In addition the shared TLB need not be dual-ported, so the extra space 
can be used to increase its capacity. No previous research known to the authors characterizes the 
design space for the micro-TLB. 

Because the TLB can be in the critical path of memory access, good TLB performance is 
essential to good overall performance of a machine. TLB design has been complicated in several 
recent architectures with split instruction and data TLBs. To date, such designs have received 
negligible attention in the research literature. Experimental results are presented herein to 
characterize the behavior of split as well as shared TLBs. 

Another feature found in several recent architectures is TLB entries that can map variable size 
pages. When such a TLB entry is loaded with a new mapping, it is also loaded with the size of 
the page to be mapped. Typically, the size is restricted to a power of two and may range from 
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4K bytes to a gigabyte [5, 7]. Although there are several obvious applications of variable size 
pages, such as mapping operating system text and graphics frame buffers, it is not yet understood 
to what degree they can be used to improve the execution rate of application code. Performance 
of applications which concentrate references on a contiguous segment could improve if that seg- 
ment were accessed as a single page with a single TLB entry. Applications which scatter 
references across a sparse address space have little hope of benefiting from large pages without 
significantly increased memory usage. Address traces and reference counting tools can be used 
to record dynamic patterns of memory access, to aid in understanding the applicability of these 
structures. 

The many recent studies on memory system behavior and performance have concentrated al- 
most exclusively on cache design [10, 9]. Little attention has been given to TLB performance. 

Early studies have shown that TLB miss penalties consume 6% of all machine cycles [4] and 4% 
of execution time [3], and hence can have a significant impact on machine performance. 
However, these results were for VAX computers with 512 byte page sizes, an order of magnitude 
smaller than is typical today, and main memory sizes two orders of magnitude smaller than those 
considered in this study. 

Wood [12, 11] proposed in-cache address translation as an alternative to a TLB. His work has 
shown that such methods are effective for programs such as Lisp applications and operating sys- 
tems, where the working set is spread over a large address space. They are less useful for be- 
havior such as is seen with typical C programs, where memory activity is concentrated in the 
bottom of several segments. His methods also become less applicable when memory access 
times become large with respect to processor speed. 

Finally, previous TLB studies [3, 11] have considered set-associative or direct-mapped or- 
ganizations. These were common when TLBs were made from discrete MSI and LSI RAMs. 
Recently, however, VLSI RISC microprocessors (e.g., [5, 7]) typically make use of fully- 
associative TLBs, since these require about the same area as set-associative TLBs when im- 
plemented within a VLSI chip. Thus the TLB implementations studied in this paper are all fully- 
associative designs. 

This paper is concerned with TLB performance. Understanding the relation between TLB 
performance and overall machine performance is a different question, involving the balance of 
compulsory to capacity misses and the relative ability of caches and TLBs to map multiple 
localities. The first access to a page results in a Compulsory TLB miss. In this situation, cache 
misses also occur, the cost of which might overshadow the TLB penalties. Other TLB misses 
are capacity misses, when a program returns to a locality that has been replaced out of the TLB, 
though it might still be present in the cache. This phenomena becomes more common as cache 
sizes increase. In this paper, CPI effects of TLB performance are discussed assuming perfect 
memory system performance. This can be deceptive, as it trivializes the question of cache per- 
formance. More exact results require simulation of a more complete memory system. 

The remainder of the paper is structured as follows. First we give some brief details on our 
methodology. This is followed by a discussion of instruction TLB behavior, both for micro- 
TLBs and for independent instruction TLBs. Data and shared TLB results are presented next, 
followed by a discussion of variable size pages, and finally some concluding notes. 
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2. Methodology 

The tracing and simulations were done on a DECstation 5000, using an updated version of the 
WRL address tracing tools [2]. Application programs are instrumented by inserting code at the 
start of each basic block and before each load or store instruction to make entries in a per-process 
trace buffer. The contents of the per-process buffer are appended to a system trace buffer each 
time the kernel is invoked, and whenever the per-process buffer becomes full. The kernel makes 
additional entries in the system buffer to record context switches, system calls, and other such 
events. When the system buffer becomes full, traced processes are suspended and an analysis 
program, in this case a TLB simulator, is started. The analysis program runs until the contents of 
the kernel buffer have been consumed, at which time tracing may continue. The system permits 
multi-process traces and on-the-fly analysis of trace data with minimum distortion to the traces. 
System references are not included in the traces. 

In simulating the various TLB configurations, a variety of workloads were used, including 
each of the SPECmarks, plus a number of other workloads meant to anticipate more realistic 
future workloads. Tree is a recursive, data intensive benchmark written in C-Scheme [1]. Magic 
[8] is a VLSI layout tool. In this run it was extracting the MultiTitan CPU chip [6]. A multi- 
tasking mix was also used, running the following programs: 

•gcc 

• magic extracting the MultiTitan CPU chip 

• Id loading magic 

• tree with a 10 megabyte heap 

• a loop running the shell programs cp, cat, sed. Is, ps, and rm. 

Short running programs were put in loops, so that their execution would continue throughout the 
entire run. This mix is meant to be comparable to the mix used in previous trace-based studies 
by Borg et.al. [2]. 

3. Instruction TLB results 

Instruction reference streams place lesser demands on TLB resources than data reference 
streams. Instruction references generally exhibit higher locality, both spatial and temporal. Also 
there is generally less memory involved. The largest text segment of the SPECmarks, when 
compiled for the DECstation, is the Gnu C compiler gcc with 688K bytes. The average text 
segment size is around 200K bytes. Data segments are frequently much larger. Data references 
for nasa7, a benchmark for numeric computation, range over a space of over three megabytes. 

Because of the different performance characteristics of TLBs and micro-TLBs, they are dis- 
cussed separately. 

3.1. Micro-TLBs 

Micro-TLBs were simulated with sizes varying from one to eight entries and Least Recently 
Used (LRU) or Least Recently Replaced (FIFO) replacement algorithms. Page sizes of 4K and 
16K bytes were used for the simulations. 
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Figure 1 is a plot of the simulation results for the eqntott benchmark, illustrative of micro-TLB 
behavior. The number of entries in the micro-TLB varies along the x axis. The y axis is scaled 
in instructions per miss, the reciprocal of the miss rate. Plots of instructions per miss are used 
because they illustrate interesting behavior more clearly than plots of miss rate, with data points 
corresponding to good performance towards the top of the graph, poor performance toward the 
bottom, and points spaced in a meaningful way, rather than disappearing along the X axis. 
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Figure 1: Eqntott micro-TLB behavior 

FIFO replacement, 4K byte pages 

There are three general regimes of behavior to be observed. Notice the flatness of the curve to 

the left side of the graph. In this region, thrashing is occurring. The number of micro-TLB 
entries is well under the number of instruction pages in the working set of the program, and 
consequently micro-TLB performance is relatively poor: 442 instructions per miss with two 
micro-TLB lines, 1490 instructions per miss with three. 

After the flat part of the curve comes a region where performance improves rapidly. For 
eqntott this occurs between three and four micro-TLB entries. Lastly comes another relatively 
flat region, where the micro-TLB has enough entries to map the entire working set of the 
program. In this region, additional micro-TLB entries do little to improve performance. 

Figure 2 illustrates micro-TLB performance for the SPECmarks. Notice a log scale is used for 
the y axis of this graph, while the y axis in Figure 1 used a linear scale. Although the log scale 
tends to obscure the different domains of behavior, it makes it possible to compare all the SPEC- 
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marks on the same graph. If a micro-TLB miss penalty of 3 cycles is assumed and the average 
number of instructions per miss is 333 or less, then about one machine cycle per one hundred 
instructions (i.e. 0.01 cycles per instruction - CPI) is lost to micro-TLB misses. With one micro- 
TLB entry, more than half the SPECmarks have this much of a penalty. With a two entry micro- 
TLB, 40% are at this penalty level. 
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Figure 2: SPECmark micro-TLB Behavior 

FIFO replacement, 4K byte pages 

Four of the SPECmarks have particularly mediocre micro-TLB performance. Fpppp and 
doduc are both floating-point benchmarks. Both of them are noted for their poor instruction 



5 



A Simulation-Based Study of TLB Performance 



cache performance, predictive of the observed micro-TLB behavior. The miss rate for fpppp for 
a simulated 4K byte instruction cache with 16 byte lines is 23%, due to its particularly long basic 
blocks (an average of 130 instructions per branch for the entire run). Computation in doduc is 
spread over a large number of procedures, and has a simulated 4K byte I-cache miss rate of 11%. 

Two language processing programs, gcc, the Gnu C compiler, and li, a lisp interpreter, have 
the worst micro-TLB performance. The I-cache miss rates for gcc and // are 10% and 2%, 
respectively. Their micro-TLB behavior is explained in considering the structure of the 
programs. Gcc for example, has a large amount of code and it tends to make many nested proce- 
dure calls. We believe that the observed behavior results from there being eight or more 
procedures involved in most of gee's localities, and that these procedures tend to be spread over 
more than eight pages. 

The SPECmarks were also simulated for micro-TLBs using an instruction page of size 16K 
bytes. The results are shown in Figure 3. At this page size, with 7 micro-TLB entries, the 
working set of virtually all the programs appears to have been reached, with the exception of 
gcc. The amount of memory fragmentation induced by the change to 16K byte pages can be 
inferred from the change in TLB resource demands of the programs. For example, eqntott uses 4 
X 4K = 16K bytes of instruction memory with 4K pages, and 48K bytes with 16K byte pages. 
Spice grows from 28K to 64K bytes. Fragmentation for 16K byte pages could be reduced with 
compilers and loaders that used heuristics or feedback information to locate the most active code 
adjacent in memory. 

One issue which turned out to be uninteresting is the effect of context switches on the micro- 
TLB. One alternative in simulations is to be pessimistic about preserving the contents of the 
micro-TLB, and flush it after every context switch. An optimistic alternative is to preserve the 
contents of the micro-TLB between context switches. It was found in the optimistic simulations 
that behavior improved in regions of already good behavior, but that there was little change 
where behavior was poor. In regions of good behavior, all micro-TLB misses are compulsory. 
In the optimistic simulations, only one compulsory miss is taken for each referenced page. In the 
pessimistic simulations, the micro-TLB is flushed, and so a compulsory miss occurs for every 
context switch. Preserving the micro-TLB reduces the number of compulsory misses, hence im- 
proving good behavior. During poor micro-TLB behavior, the vast majority of misses are not 
compulsory but capacity misses, and so changing the number of compulsory misses has a negli- 
gible effect on the overall results. The simulations used to generate data for this paper use the 
pessimistic model. 

Another micro-TLB design parameter that was considered is replacement policy. For two 
entries, LRU is easily implemented in hardware. For more than two entries, hardware LRU 
becomes more difficult. An interesting alternative for micro-TLBs is least recently replaced. 
This has the advantage of a relatively straightforward hardware implementation as a first-in first- 
out queue. FIFO is used in this exposition to refer to this replacement policy. 

Figure 4 shows relative performance of LRU and FIFO replacement policies for the ten SPEC- 
marks. LRU is uniformly better, but not by a large amount. Again, the log scale makes it pos- 
sible to compare all the benchmarks in the same graph, although it tends to obscure the real 
difference in performance, sometimes as much as a factor of two. Nonetheless, FIFO perfor- 
mance is always comparable with LRU performance when the overall affect on CPI is con- 
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Figure 3: SPECmark micro-TLB Behavior 

FIFO replacement, 16K byte pages 

sidered, establishing it as an interesting replacement policy for micro-TLBs of size greater than 
two. 

Simulations were also run to compare LRU, FIFO and Random replacement policies in full 
size data TLBs. It was found that FIFO performs uniformly better than Random, failing only in 
pathological worst case situations. For most machines with hardware to support Random re- 
placement, FIFO can be easily implemented, although it is expected that Random replacement 
will be used to avoid pathologic worst case behavior. 
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Figure 4: LRU vs. FIFO Replacement 

4 entry micro-TLB, 4K byte pages 

Table 1 shows miss rates for the ten SPECmarks, plus tree and magic. Under the assumption 
of a 3 cycle miss penalty, figures in bold-face indicate where the micro-TLB penalty is greater 
than 0.01 CPI. Again, with 4K byte pages, it is observed that most of the SPECmarks achieve 
reasonable performance with a two entry micro-TLB, although improvement is observed with 
larger sizes. Figure 5 shows how data from Table 1 can be used to estimate CPI contribution for 
a given SPECmark, micro-TLB configuration, and miss penalty. 

Note that the micro-TLB performance for tree is particularly poor. With 4K byte pages, a two 
entry micro-TLB absorbs about 0.10 CPI. Micro-TLB performance for tree begins to improve 
rapidly beyond four micro-TLB entries. There are two effects that could conceivably be con- 
tributing to the degraded performance: the working set size of the computation, and conflicts 
between the garbage collector and the rest of the computation. The garbage collector and the 
program behave as independent co-routines or threads in the single address space. In the case of 
tree, most of the observed miss rate is due to locality properties of the compiled Scheme code, as 
co-routine exchanges between the garbage collector and the program execution are much too 
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Table 1: SPECmark FIFO micro-TLB Miss Rates 



2 entry FIFO micro-TLB with 4K byte page 

3 cycle miss penalty 

CPI Contribution = 3 * 0.0133 = 0.0399 CPI 

Figure 5: Estimating micro-TLB CPI 
Contribution for gcc 

infrequent^ to account for the observed miss rates. 

Although multi-thread effects were not important with tree, it is worth noting that, in as much 
as threads tend to execute in independent code localities, tightly coupled threads executing in a 
single address space will lead to degraded micro-TLB performance. This effect will be ex- 
aggerated with very small micro-TLBs. 



^The tree execution takes about 147 seconds, of which 2 seconds are spent collecting garbage. Garbage collec- 
tions were observed to occur every 8-15 seconds. 
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3.2. Instruction TLBs 

We simulated instruction TLBs with sizes in powers of two from 8 to 64 entries, and page 
sizes in powers two from 4K to 64K bytes. In examples used to illustrate TLB performance, an 
approximation of 100 cycles is used for the TLB miss penalty. It is assumed that a TLB miss 
will cause two memory references, corresponding to a page table lookup, and that the latency of 
these memory references will dominate the miss penalty. The penalty of ICQ cycles corresponds 
to a futuristic microprocessor with a cycle time of under five nanoseconds, and is meant to be 
somewhat less than the time for two references to main memory and somewhat more than the 
time for two references to an off chip cache. Under these assumptions, with TLB performance 
of 10000 instructions per miss, 0.01 CPI is lost to the TLB. This is used as a somewhat arbitrary 
lower bound on reasonable TLB performance. 

The behavior of single-benchmark workloads in instruction-only TLBs is mostly uninterest- 
ing, as gcc is the only SPECmark that presents a significant demand on resources. Figure 6 
illustrates the behavior for gcc. Again log scales are used. Solid lines connect TLB configura- 
tions with the same page size. Dashed lines connect TLBs that map the same amount of 
memory. The amount of memory mapped by a TLB will be referred to as its mapping size, to 
discriminate between that measure of size and others, such as the number of lines in a TLB. 

As with the micro-TLB, there are three regimes of behavior to be observed. The placement of 
data points is more compact in the lower portion of the graph, with relatively little improvement 
for larger TLB configurations. This represents thrashing, trying to operate with TLB resources 
well below the working set size of the program. In the next region, performance improves 
quickly as working set size is approached. This figure shows that gcc approaches the TLB 
resources to map its working set with an instruction TLB mapping size of 512K bytes. Once the 
working set size of the program has been reached, increasing the mapping size has a reduced 
effect on performance, and the points again become more closely spaced. Such behavior is ob- 
served at the top of the graph. These three regimes of behavior become more pronounced for 
shared and data TLBs, as in Figure 8. 

Note that in the lower part of the graph, the dashed lines that connect TLBs with the same 
mapping size tend to slope upward slightly, while at the top of the graph they slope down. An 
application that uses sparse, non-contiguous data tends to have better TLB performance with 
more smaller pages of memory than with a few larger pages. Such behavior is observed in the 
lower part of the graph. In the upper part of the graph, the dashed lines tend to slope down, 
hence better performance with fewer larger pages. As a TLB becomes large enough to map all 
of a program's working set, smaller pages mean that TLB misses occur for each of several small 
pages, rather than once for a single large page. Similar behavior is observed when a program 
accesses contiguous data, as in Figure 8. 

Figure 7 illustrates insfruction TLB performance for the multi-task mix. For this workload, 
the flat dashed lines suggest that TLB performance is determined entirely from mapping size, 
and that the configuration of page size and number of entries has little effect. TLB performance 
crosses the 10000 instruction per miss performance boundary at a mapping size of 512K bytes, 
and appears to have reached the working set size for mapping sizes over 2 megabytes. 
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Figure 6: Gcc Instruction TLB Behavior 

Random Replacement, Fully Associative 

Table 2 gives miss rates for selected benchmarks for a number of instruction-only TLB con- 
figurations. The benchmarks not included have very low miss rates. 

Our experiments suggest that an instruction TLB with 16 entries and 16K byte pages is ade- 
quate for most application workloads. One shortcoming of these experiments is that they do not 
consider the effects of operating system instruction references. It is known that system code 
generally exhibits worse instruction locality than applications. When designing an architecture 
to support a specific operating system, this should be a consideration. 
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Figure 7: Multi-task instruction TLB behavior 

Random Replacement, Fully Associative 
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Table 2: Instruction TLB Miss Rates 

Random Replacement, Fully Associative 
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4. Data and Shared TLB Results 

Data-only TLBs and shared TLBs behave similarly. Most programs have well behaved in- 
struction reference patterns, so in the shared case behavior is dominated by data references. Al- 
though the following discussion is in terms of shared TLBs, the analysis can be applied to the 
data-only case as well. Shared TLBs were simulated with sizes in powers of two from 32 to 256 
entries, page sizes in powers of two from 4K to 64K bytes, and with LRU and Random replace- 
ment policies. All TLBs simulated were fully associative. 




32 64 128 256 

TLB entries 



Figure 8: Mat300 Shared TLB Behavior 

Figure 8 shows the shared TLB behavior for matSOO, which was found to be the SPECmark 
with the worst TLB performance. Note that the TLBs in this figure are a factor of four larger 
than those considered with instruction TLBs, ranging from 32 to 256 entries as opposed to 8 to 
64 entries. With a miss penalty of 100 cycles and a 64 entry TLB with 4K byte pages, matSOO 
spends 5 CPI on the TLB. MatSOO does matrix operations on three matrices with a total size of 



13 



A Simulation-Based Study of TLB Performance 



approximately 2.5 megabytes. The contents of these three matrices are accessed in regular pat- 
terns, sometimes sequentially and sometimes stepping by columns. This explains the poor be- 
havior when the TLB maps less than 2.5 megabytes, and rapid improvement as that barrier is 
reached and surpassed. Observe that lines connecting TLBs with the same mapping size always 
slope down, consistent with the observation that matSOO accesses its data contiguously. 




100-1 \ \ \ 
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Figure 9: Tree Shared TLB Behavior 

Random Replacement Fully Associative 

Figure 9 shows the TLB behavior for tree running with a 10 megabyte heap. Memory is 
allocated from 5 megabytes of the heap, while the other 5 megabytes is reserved for garbage 
collection. As expected, performance improves steadly through a mapping size of 5 megabytes 
(16k pages x 256 entries), after which the rate of improvement begins to diminish, shown by the 
downward sloping lines connecting configurations with equal mapping sizes. Note that below 
the 5 megabyte boundary, the lines connecting TLBs with the same mapping size are nearly 
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horizontal. This indicates that TLB performance for this benchmark depends exclusively on the 
mapping size. Other TLB parameters are unimportant. 

As a last illustration of shared TLB behavior, Figure 10 shows a plot for the multi-task mix. 
Several conclusions are immediate. First, as for tree, the mapping size of the TLBs is the 
dominant factor in performance. Only after a mapping size of eight megabytes do the dashed 
lines stop looking horizontal. Also, most of the plot is fairly compact. Data points become less 
compact beyond a mapping size of eight megabytes, as the working set size is approached. 
Lastly, for all TLBs with a mapping size of one megabyte or less, assuming a 100 cycle miss 
penalty, at least 0.05 CPI is lost to the TLB, a significant performance penalty. This suggests 
that if a machine with a TLB is to execute such a workload efficiently, the TLB must have a 
significantly larger mapping size. 




Figure 10: Multiprocess Mix TLB Behavior 

Random Replacement, Fully Associative 
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Table 3 shows TLB miss rates for the ten SPECmarks, plus several other workloads of inter- 
est. This table is highlighted to indicate where the TLB penalty is more than 0.01 CPL With the 
exceptions of nasa? and matSOO, both of which are oriented towards scientific/vector machines, 
the SPECmarks perform better than this with a 64x1 6k TLB, suggesting that such a configura- 
tion provides adequate TLB performance. The smaller configurations don't perform as well. 
This suggests that sometime in the near future, TLBs with larger mapping sizes will be needed, 
especially for machines running numeric programs. 



Table 4 gives miss rates for a number of data-only TLB configurations. 
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Table 4: Data TLB Miss Rates 
Random Replacement, Fully Associative 



Figure 11 compares miss rates for shared and split TLBs. All TLBs use 64 entries and 4K 
byte pages. For all but gcc and the multi-task mix, the instruction miss rates are inconsequential. 
Note that the sum of the split instruction and data miss rates is generally less than the miss rate 
for the shared TLB. This difference represents competition for TLB entries between instruction 
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and data references. This comparison is not meant to suggest two specific implementation alter- 
natives, as the split TLBs illustrated use twice the resources of the shared TLB. 




Figure 11: Shared vs. Split TLBs 64 entries, 4KB pages 
Random Replacement, Fully Associative 

5. Variable Size TLB Entries 

An interesting question for future work is how to make use of the variable size TLB entries 
that have appeared in recent architectures [5, 7]. Maps of the dynamic patterns of memory ac- 
cess are useful to understand this problem. Figure 12 shows the pattern of data memory accesses 
for matSOO. Page address varies in the x dimension, from 0x10000000 on the left to 0xl021e000 
on the right, a range of about 2.2 megabytes. Instruction count (i.e. time) varies along the y 
dimension, ranging from 0 at the top to 2.63 billion at the bottom. The darkness of each square 
corresponds to the number of accesses per 16K byte page during a 1000000 instruction interval. 

The three matrices used by matSOO are clearly visible from the usage patterns in the address 
space. The compactness and predictability of the matSOO accesses show that the use of larger 
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mat300-data 

instruction range: 0-2770000000 
page range : 10000-1021e 

block = 10000000 instructions x 0x2000 bytes 



Figure 12: matSOO Data Memory Access Patterns 

pages could virtually eliminate TLB misses, provided that adequate memory resources were 
available. 

Tree, the lisp benchmark, also shows interesting data reference patterns, illustrated in Figure 
13. Note that a page size of 64K bytes was used. The address space represented in this figure is 
about 11 megabytes. The descending staircase pattern shows the behavior of the memory al- 
locator as it walks across the heap. Solid vertical bands show where garbage collection has 
compacted the heap into frequently accessed regions. The pattern of memory references for tree 
is sparse relative to matSOO. This, along with the size of the address space, suggests that lisp 
workloads such as tree are relatively poor candidates for variable size pages. 
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tree-data 

instruction range: 0-2410000000 
page range : lOOOO-lOaSc 

block = 10000000 instructions x 0x10000 bytes 



Figure 13: tree Data Memory Access Patterns 

Interesting patterns of reference are the exception rather than the rule in memory access pat- 
terns. Most of the benchmarks concentrate on a small number of unclustered pages, resulting in 
a few dark vertical bars from the top to the bottom of the map, with occasional horizontal excur- 
sions. 

Figure 14 shows a map of instruction references for gcc. Each point represents one or more 
references to a 4K byte page during an interval of 100000 instructions. The address space 
spanned in this figure is 684K bytes, the largest text segment of any of the SPECmarks. The 
number of different pages touched during a single 100000 instruction interval illustrates clearly 
why gcc places high demands on the TLB. If variable size memory pages were to be used to 
improve gcc performance, the only solution would be load the entire program text into a con- 
tiguous segment. 

For instruction references, compilers might use feedback information on performance critical 
applications to locate active text contiguously, making the use of a single larger TLB entry a 
more attractive option. Such techniques are more difficult to apply to data references, as heap 
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gcc-inst 

instruction range: 0-22700000 
page range : 400-4a7 

block = 100000 instructions x 0x1000 bytes 



Figure 14: gcc Data Memory Access Patterns 

allocated structures are allocated dynamically, and so their location is not under the control of 
the compiler. With the relocatable nature of lisp data, it might be possible to tune garbage col- 
lectors to improve the locality of reference. For uncollected memory allocation schemes, a tool 
using feedback information could make suggestions of how to order heap data allocation to im- 
prove contiguity of data. 

6. Conclusions 

This study has investigated the performance of one and two-level instruction TLBs, data 

TLBs, and shared TLBs, as well as analyzing the potential performance implications of variable- 
sized pages. In contrast to previous studies, this work concentrated on fully-associative TLB 
organizations and split instruction and data reference streams. 

For instruction TLBs, programs such as gcc and li that make many nested calls to small 
procedures are the hardest to satisfy. For most of the SPECmarks, 4K byte pages and a two 
entry micro-TLB (whose misses are serviced in several cycles by a shared TLB) perform reason- 
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ably well. For example, with a 3 cycle micro-TLB miss penalty (i.e., assuming that the reference 
hits in the 2nd-level TLB) all SPECmarks except gcc and li incur a CPI of less than 0.03 due to 
microTLB misses, gcc and li can achieve this level of performance with 4-entry micro-TLBs, 
but incur a CPI penalty of about 0.06 with a 2-entry micro-TLB. A FIFO replacement policy 
performs almost as well as LRU for micro-TLBs. 

In single-level instruction, data, and shared TLBs, TLB performance is usually dominated by 
how much memory is mapped. Single-level fully-associative instruction TLBs (or the second 
level of a two-level organization) with more that 32 entries, 4K byte pages, and a 100 cycle miss 
penalties incur CPIs of under 0.1 even for gcc. Performance on other benchmarks and with 
larger TLBs is better. With the larger capacities and miss penalties of full size instruction TLBs, 
multi-tasking and system effects also become important. 

A data or shared TLB mapping 256K bytes in 4K byte pages (i.e., 64 entries) with 100 cycle 
miss penalty incurs 0.1 CPI or less for all of the SPECmarks except nasa? and matSOO. Both of 
these are scientific/vector oriented programs with large data sets. Furthermore, column access 
(i.e., non-unit stride) can result in successive data references to successive pages, disastrous for 
TLB performance unless the entire data set is mapped at the same time, nasa? and matSOO incur 
a CPI of 1.7 and 4.9, respectively, for the TLB parameters given above. This is not reduced to 
under 0.1 CPI for matSOO until the TLB can map 2 megabytes (e.g., 256 entry TLB with 8K byte 
pages). Work with more demanding workloads suggests that future TLBs must map sig- 
nificantly more memory. 

One way to increase the amount of memory mapped without requiring an unreasonably large 
number of TLB entries is the use of variable-sized pages. Memory access plots suggest that the 
use of very large pages (e.g., 256K byte or greater) for the data space of matSOO and tree, and the 
instruction space of gcc could vastly reduce the size of the TLB required for good performance 
while decreasing its miss rate. 

One significant shortcoming of this TLB analysis is the inability to consider operating system 
effects. We are currently involved in completing a new tracing system which includes system 
traces, with the intention of performing a thorough exploration of operating system memory be- 
havior on modern RISC processors. 
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